[BUG] Idempotent creates of attached functions fail. #5954

rescrv · 2025-12-05T01:02:44Z

Description of changes

Idempotent creates of attached functions fail because we check the error
and then throw an error anyway. This PR adds a carve out for that error
condition and adds a test of the case that got us into the state where
this was noticed.

Test plan

CI

Migration plan

N/A

Observability plan

N/A

Documentation Changes

N/A

github-actions · 2025-12-05T01:02:58Z

propel-code-bot · 2025-12-05T01:03:16Z

Handle idempotent attached function creation across coordinator and DAO

The PR fixes coordinator idempotency when re-attaching functions by ensuring the DAO can retrieve both ready and pending attached functions and by tolerating duplicate insert attempts. A new distributed integration test reproduces the attach-detach-attach sequence to prevent regressions, and Go mocks/interfaces are updated accordingly.

Key Changes

• Add GetReadyOrNotReadyByCollectionID to go/pkg/sysdb/metastore/db/dao/task.go and surface it through IAttachedFunctionDb
• Update go/pkg/sysdb/coordinator/task.go to consult the new DAO method and treat common.ErrAttachedFunctionAlreadyExists as an idempotent success path
• Extend chromadb/test/distributed/test_task_api.py with test_count_function_attach_and_detach_attach_attach covering repeated attach calls after detach
• Refresh generated mocks to include the new DAO method

Affected Areas

• go/pkg/sysdb/coordinator/task.go
• go/pkg/sysdb/metastore/db/dao/task.go
• go/pkg/sysdb/metastore/db/dbmodel/task.go
• chromadb/test/distributed/test_task_api.py
• go/pkg/sysdb/metastore/db/dbmodel/mocks/IAttachedFunctionDb.go

This summary was automatically generated by @propel-code-bot

propel-code-bot · 2025-12-05T01:04:15Z

chromadb/test/distributed/test_task_api.py

+def test_count_function_attach_and_detach_attach_attach(basic_http_client: System) -> None:
+    """Test creating and removing a function with the record_counter operator"""
+    client = ClientCreator.from_system(basic_http_client)
+    client.reset()
+
+    # Create a collection
+    collection = client.get_or_create_collection(
+        name="my_document",
+        metadata={"description": "Sample documents for task processing"},
+    )
+
+    # Create a task that counts records in the collection
+    attached_fn = collection.attach_function(
+        name="count_my_docs",
+        function_id="record_counter",  # Built-in operator that counts records
+        output_collection="my_documents_counts",
+        params=None,
+    )
+
+    # Verify task creation succeeded
+    assert attached_fn is not None
+    initial_version = get_collection_version(client, collection.name)
+
+    # Add documents
+    collection.add(
+        ids=["doc_{}".format(i) for i in range(0, 300)],
+        documents=["test document"] * 300,
+    )
+
+    # Verify documents were added
+    assert collection.count() == 300
+
+    wait_for_version_increase(client, collection.name, initial_version)
+    # Give some time to invalidate the frontend query cache
+    sleep(60)
+
+    result = client.get_collection("my_documents_counts").get("function_output")
+    assert result["metadatas"] is not None
+    assert result["metadatas"][0]["total_count"] == 300
+
+    # Remove the task
+    success = attached_fn.detach(
+        delete_output_collection=True,
+    )
+
+    # Verify task removal succeeded
+    assert success is True
+
+    # Create a task that counts records in the collection
+    attached_fn = collection.attach_function(
+        name="count_my_docs",
+        function_id="record_counter",  # Built-in operator that counts records
+        output_collection="my_documents_counts",
+        params=None,
+    )
+    assert attached_fn is not None
+
+    # Create a task that counts records in the collection
+    attached_fn = collection.attach_function(
+        name="count_my_docs",
+        function_id="record_counter",  # Built-in operator that counts records
+        output_collection="my_documents_counts",
+        params=None,
+    )
+    assert attached_fn is not None


[Maintainability] [CodeDuplication] The new test test_count_function_attach_and_detach_attach_attach is very similar to the existing test_count_function_attach_and_detach test in the same file. A significant portion of the code, including setting up the collection, attaching the function, adding data, and verifying the initial run, is duplicated.

To improve maintainability and reduce redundancy, consider refactoring the common setup and execution logic into a helper function that both tests can call.

Context for Agents

[CodeDuplication] The new test `test_count_function_attach_and_detach_attach_attach` is very similar to the existing `test_count_function_attach_and_detach` test in the same file. A significant portion of the code, including setting up the collection, attaching the function, adding data, and verifying the initial run, is duplicated. To improve maintainability and reduce redundancy, consider refactoring the common setup and execution logic into a helper function that both tests can call. File: chromadb/test/distributed/test_task_api.py Line: 296

tanujnay112 · 2025-12-05T01:21:33Z

go/pkg/sysdb/coordinator/task.go

 		err = s.catalog.metaDomain.AttachedFunctionDb(txCtx).Insert(attachedFunction)
-		if err != nil {
+		if err == common.ErrAttachedFunctionAlreadyExists {
+			// idempotent fall through


Should we verify that the nonready function that exists actually matches the currently requested function?

Isn't there a bunch of code above that does this?

It does that only for functions that are ready. You might be able to reuse the above checks by changing GetByCollectionID to a function called GetAnyByCollectionID that returns deleted (consistency with other GetAny methods) and non-ready functions.

go/pkg/sysdb/metastore/db/dao/task.go

tanujnay112 · 2025-12-05T22:54:01Z

go/pkg/sysdb/metastore/db/dao/task.go

+	var attachedFunctions []*dbmodel.AttachedFunction
+	err := s.db.
+		Where("input_collection_id = ?", inputCollectionID).
+		Where("is_deleted = ?", false).


Technically all the other GetAny methods return deleted functions too so this is an inconsistency. I can add this to my to-clean-up list for cleanup day.

Returning "Deleted" breaks this case. What would you call between GetBy and GetAnyBy?

I've renamed it ReadyOrNotReady and documented.

tanujnay112 · 2025-12-05T22:55:18Z

go/pkg/sysdb/coordinator/task.go

+		existingAttachedFunctions, err := s.catalog.metaDomain.AttachedFunctionDb(txCtx).GetAnyByCollectionID(req.InputCollectionId)
 		if err != nil {
 			log.Error("AttachFunction: failed to check for existing attached function", zap.Error(err))
 			return err


[Re: line +102]

Ok now this error is wrong. If you have more than one ready attached function that's a problem. But it could be possible to have a bunch of partially attached nonready function.

See this comment inline on Graphite.

Why would we allow that? That seems like a buggy state to allow.

What if they never become ready and are the cumulative result of multiple failed backfill requests?

That sounds like we're letting our invariants go loose. If there can only be one, better not make them fight for that distinction.

Invariant is that there can only be one ready function, that reminds me that I need to add validation for that before making a function ready in the commit phase of AttachFunction 2PC.

go/pkg/sysdb/metastore/db/dao/task.go

go/pkg/sysdb/coordinator/task.go

Idempotent creates of attached functions fail because we check the error and then throw an error anyway. This PR adds a carve out for that error condition and adds a test of the case that got us into the state where this was noticed. retrieve any (ready/not-ready) task Rename ReadyOrNotReady

tanujnay112 · 2025-12-06T01:42:05Z

go/pkg/sysdb/coordinator/task.go

+		existingAttachedFunctions, err := s.catalog.metaDomain.AttachedFunctionDb(txCtx).GetReadyOrNotReadyByCollectionID(req.InputCollectionId)
 		if err != nil {
 			log.Error("AttachFunction: failed to check for existing attached function", zap.Error(err))
 			return err


[Re: line +125]

This line of code should be doing what your change down below intends on doing. I wonder with your addition of GetReadyOrNotReadyByCollectionID whether the right thing to do now is to get rid of the fall-through on line 198.

We also now need to make sure line 121 doesn't error out too early if it happens to read a non-matching unready function first.

See this comment inline on Graphite.

rescrv requested review from Copilot and tanujnay112 and removed request for Copilot December 5, 2025 01:03

Copilot started reviewing on behalf of rescrv December 5, 2025 01:03 View session

propel-code-bot bot reviewed Dec 5, 2025

View reviewed changes

Copilot finished reviewing on behalf of rescrv December 5, 2025 01:04

tanujnay112 reviewed Dec 5, 2025

View reviewed changes

propel-code-bot bot reviewed Dec 5, 2025

View reviewed changes

go/pkg/sysdb/metastore/db/dao/task.go Outdated Show resolved Hide resolved

tanujnay112 reviewed Dec 5, 2025

View reviewed changes

This comment has been minimized.

Sign in to view

propel-code-bot bot reviewed Dec 5, 2025

View reviewed changes

go/pkg/sysdb/metastore/db/dao/task.go Outdated Show resolved Hide resolved

rescrv force-pushed the rescrv/coordinator-idempotent branch from 81e1ce2 to 6f1db5f Compare December 5, 2025 23:07

propel-code-bot bot reviewed Dec 5, 2025

View reviewed changes

go/pkg/sysdb/coordinator/task.go Outdated Show resolved Hide resolved

rescrv added 2 commits December 5, 2025 15:42

typo

6882911

rescrv force-pushed the rescrv/coordinator-idempotent branch from 6f1db5f to 6882911 Compare December 5, 2025 23:43

This comment has been minimized.

Sign in to view

Update mocks

5e60d8a

tanujnay112 reviewed Dec 6, 2025

View reviewed changes

rescrv closed this Dec 8, 2025

[BUG] Idempotent creates of attached functions fail. #5954

[BUG] Idempotent creates of attached functions fail. #5954

Conversation

rescrv commented Dec 5, 2025

Description of changes

Test plan

Migration plan

Observability plan

Documentation Changes

Uh oh!

github-actions bot commented Dec 5, 2025

Reviewer Checklist

Testing, Bugs, Errors, Logs, Documentation

System Compatibility

Quality

Uh oh!

propel-code-bot bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

propel-code-bot bot Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tanujnay112 Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

Uh oh!

Uh oh!

This comment has been minimized.

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

propel-code-bot bot commented Dec 5, 2025 •

edited

Loading

tanujnay112 Dec 5, 2025 •

edited

Loading